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Abstract 

We test against two different sets of data an apparently new ap- 
proach to the analysis of the variance of a numerical variable which 
depends on qualitative characters. We suggest that this approach be 
used to complement other existing techniques to study the interdepen- 
dence of the variables involved. According to our method the variance 
is expressed as a sum of orthogonal components, obtained as differ- 
ences of conditional means, with respect to the qualitative characters. 
The resulting expression for the variance depends on the ordering in 
which the characters are considered. We suggest an algorithm which 
leads to an ordering which is deemed natural. The first set of data con- 
cerns the score achieved by a population of students, on an entrance 
examination, based on a multiple choice test with 30 questions. In this 
case the qualitative characters are dyadic and correspond to correct or 
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incorrect answer to each question. The second set of data concerns the 
delay in obtaining the degree for a population of graduates of Italian 
universities. The variance in this case is analyzed with respect to a 
set of seven specific qualitative characters of the population studied 
(gender, previous education, working condition, parent's educational 
level, field of study, etc.) 

1 Introduction and methodology 

Let X — {xi, . . . xn) he a numerical variable defined on a population P of 
N individuals. We may think of X as an element of a real vector space L 
of dimension N. We equip L with a real, normalized scalar product: for 
X,Y Eh, and Y = {yi, . . . yj^), we define: 

1 ^ 

1=1 

The length or norm of a vector is defined in terms of the scalar product: 

||X||^ =< x,x > . 
The mean value of a vector X is of course the scalar 

i=l 

We may also think of the mean value as a vector Eo{X) of L having all its 
components equal to the scalar X. In this context Eq may be thought of as 
a linear operator defined on L and mapping L into the subspace of constant 
vectors. The variance of X can be written then as: 

V{X) = \\X - Eo{X)\\^ =< X - Eo{X),X - Eo{X) > . 
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We now suppose that the indices i = 1, . . . ,N, correspond to individuals of a 
population P, and that X is a numerical variable defined on the population 
P. We further suppose that tt is a partition of the population P into q disjoint 
classes Pi, P2, . . . , Pq. Denote by \Pj\ the number of elements of Pj, so that 
N = \Pi\-\ \- \Pq\- We can then define a vector E^{X) with components: 

Observe that two components of this vector are identical if their indices 
belong to the same class Pk of the partition tt. The trivial identity: 

X - Eo{X) = E^{X) - Eo{X) +X- E^{X), 

implies 

V{X) = \\X - Eo{X)\\' = \\E^{X) - Eo{X)f + ||X - E^{X)\\^, 

because, as it is easily seen, E^{X) — Eo{X) and X — Et^{X) are orthogonal 
vectors. 

Suppose now that tti, 7r2, . . . , 7r„ is a finite sequence of partitions of the pop- 
ulation P, into respectively qi,q2, ■ ■ ■ , qn, classes. Suppose further that each 
partition ttj is a refinement of the partition 7rj_i. (This means that each class 
of the partition ttj is contained in a class of the partition TTj-i). Define for 
completeness the trivial partition ttq consisting of the full population P. Let 
P^, for k = 1, . . . qj he the disjoint classes of the population P relative to the 
partition ttj. With reference to the partition ttj define the operator 

E,{X) = E^^{X). 

In this fashion (1) reads: 

' ^ ' heP^ 
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Observe that this definition makes sense also in the case j = 0. The trivial 
identity 

n 

X - Eo{X) = J2[E,{X) - + X - E^{X), (2) 

i=i 

implies, because of the orthogonality of the terms on the right hand side of 
0, 

n 

V{X) = J2 - E,^i{X)r + \\X- (3) 

We are interested in the case in which the sequence of partitions ttj is defined 
by a sequence of qualitative characters Ci, C2, . . . , C„ of the population P. 
We can define the partition Hj by considering the classes of the population 
formed by individuals with identical values of the first j characters. 
In this case the first n summands on the right hand side of (E]) represent 
the contributions to the variance of the n qualitative characters Ci, . . . , C„ 
within the population considered. 

Observe however that, while the sum of the first n terms of the right hand 
side of ([3]) is independent of the order in which the characters Ci, . . . ,C„ 
are considered, the operators Ej, for < j < n are defined with respect 
to partitions which strongly depend on the order in which the characters 
are taken. As an obvious consequence, the value of each term — 
also depends on the order of the characters. In a different order 
the characters would define a different set of partitions; only ttq and vr^, and 
consequently Eq and E^ are independent of the chosen order. 
We are led therefore to look for a natural order of the qualitative characters 
considered. We propose an ordering based on systematic, step by step, com- 
parisons of the conditional means with respect to the variables considered. 
This ordering, which we call Stepwise Optimal Ordering (SOO) is defined as 
follows: 
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We choose the character Ci and the corresponding partition tti which max- 
imizes — Eq^X)]]"^. If Ci,...,Ck are chosen, the character Ck+i is 
chosen so that it refines the partition Tr^ into the partition iTk+i in such a 
way that the value \\Ek+i{X) — i?fc(X)|p is largest. 

The order Ci,...,C„ determined in this fashion may be considered as a 
ranking of the variables. One should be aware, however, that this ranking 
cannot be interpreted in terms of relative importance in determining the 
phenomenon measured by the variable X. As will be seen in the applications 
below, the qualitative characters considered may be far from independent. 
This may imply that a character which is recognized as a primary cause of 
the intensity of the phenomenon measured by X, may be mediated by other 
characters to whom it is associated, and therefore appear in the last positions 
of the ranking. 

We do not propose a clear cut interpretation of the significance of the rank- 
ing obtained by our method, nor of the relative size of the first n addends 
which appear in ([3]), when the qualitative characters are ordered according 
to our prescription. On the contrary, rather than expecting straight answers, 
we expect that both the ranking and the relative size of the addends in 
the expression Q would solicit questions concerning the dependence of the 
variable X on the qualitative variables and the interdependence of the qual- 
itative variables themselves (with all the cautions regarding the possibility 
to consider causal relations between the variables, [21 El El IE] ) • 
Nevertheless, in the very special case considered in the simulated experiment 
of Section HI our method yields a ranking that refiects the relative weight of 
the characters. 

In the following two sections we apply our method and discuss the "ranking" 
of the qualitative characters, thus obtained to the two sets of data mentioned 
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in the abstract. The fourth section is dedicated to a simulated experiment. 
We should mention that the ideas contained in Chapter 8D of [5J were in- 
fluential in the inception of this work, which started as an attempt to apply 
Diaconis' ideas to the case of tree-structured data, under the action of the 
group of tree-automorphisms. Under this action the ranges of the operators 
Ej — Ej^i turn out to be irreducible subspaces of L. 

2 The score on an entrance examination 

Entering students of the University of Rome "La Sapienza" in scientific and 
technical fields take a multiple choice test in mathematics, which consists of 
30 questions. The test, in Italian, may be downloaded at [Ij. At the moment 
the purpose of the test is to discourage students who do not have an adequate 
background, and to make students aware of their potential weaknesses. 
We consider a population of 2,451 students who took the test in 2005, and 
we let X be the score achieved by each student, that is the number of correct 
answers. The variable X depends on the 30 dyadic characters, corresponding 
to the correct or incorrect answer to each question. Of course, in this case, 
E3o(X) =X, and 

30 

V{X) = \\X- Eo{X)r = ^ \\E,{X) - E,^,{X)r. 

i=i 

The variable X takes values between and 30. Its mean value is 12.9 and 
the variance is V{X) = 29.8. The histogram of X is in Fig. 1. 

An application of our method shows that just ten questions, chosen according 
to the ranking we propose, "explain" 88% of the variance. In other words, if 
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Figure 1: The histogram of the score. 



we write 
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V{X) = - E,_,{X)\\' + \\X - E 

i=i 
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the remainder term — £'io|p = 3.58 amounts to just 12% of l^(X) = 29.8. 
We presently hst the remainders \\X — Ek{X)\\'^, for A; = 1, ... , 10, obtained 
by applying our method, as percentage of V{X). To wit the values = 
\\X-E,(X)r/V(X), 



r — r — r 

100' 2 100' 3 



48 
100' 



40 
100' 



34 
100' 



Ce 100' ^7 100' "^8 100' '^d 100' '^10 100- 

We do not claim, of course, that our method necessarily chooses the 10 
characters for which \\X — Eio{X)\\'^ is lowest. In general, with arbitrary 
data, this may not be the case. 

However, in this particular case, our choice compares well with other possi- 
ble choices, as shown by the experiment which we presently describe. We se- 
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Figure 2: Histogram of the values of residual variance (4), as percentage of 
total variance for 300 randomly selected subsets of 10 questions. 

lected, at random, 300 subsets of ten elements of the original thirty questions 
and we computed the conditional mean Et^{X) with respect to the partition 
TT obtained by grouping together the students with identical performance on 
each of the ten question chosen. We computed then 

\\X-E^{X)r, (4) 

relative to each ten element choice. The results are summarized in Fig. 2. 



Observe that the lowest value of quantity (4) achieved by one of the 300 
subsets we selected, is higher than 0.14, while with our choice of a subset of 
characters, we achieved a value of 0.12. 

The experiment shows that the algorithm we propose performs decidedly 
better than a random choice if we want to choose ten out of thirty questions, 
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in such a way that the total variance of the variable X is best explained. 
In conclusion there is at least some experimental evidence that our method 
may be used to select a small number of characters which account for most 
of the variance. 

It is interesting to compare our results with the results obtained with linear 
regression. We found that the agreement between the results is strong. Eight 
of the ten variables selected by SOO are among the ten most important 
variables in terms of linear regression. Furthermore, the order of the first five 
variables coincides with both methods. We also observed that the variables 
selected according to SOO have the properties of discriminating the students 
(the differences of percentages of correct and incorrect answers is small). 



3 The variable "delay in completing a de- 
gree" 

The Italian system of higher education is characterized by the marked dif- 
ference between the time employed by most students to complete a degree 
and the number of years formally required to graduate. The average delay 
in completing a degree is well above two years for most fields of study 0. In 
this section we consider a population of Italian university graduates obtained 
using the data bank "AlmaLaurea" which collects data of university gradu- 
ates from a set of Italian universitiej^. The population amounts to 58,091 

"'^The recent reform of the university system may hopefully change this in the near 
future. 

^AlmaLaurea Consortium is an association of 49 Italian universities which, since 
1994 collects statistical data about the scholastic and employment records of university 
graduates 13] ■ The data bank of AlmaLaurea is also made available, under certain con- 
ditions, to prospective employers. 
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graduates of 27 universities in 2003. On this population the variable X rep- 
resents the delay in completing the degree, computed in years, starting from 
a conventional date (November 1st) in which according to formal regulations 
the degree should have been completed. We excluded delays above ten years, 
which should be better interpreted as leaving and resuming the studies after 
several years. We study the dependence of X on seven possible characters, 
which are the following: 

(UN) University where the degree was obtained 

(PE) Parent's level of education 

(HS) Type of high school attended 

(GD) Grade in the final year of high school 

(MA) Degree major 

(WO) Working or not working during the studies 



Proceeding as outlined in the introduction, we obtain the following ranking 
of the seven variables: 



(GN) Gender 



GD, UN, MA, HS, PE, WO, GN. 



Accordingly we consider the operators 



^0, ^1, ^2, ^3, ^4, ^5, ^6, ^7, 



and write 
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The variance of the variable X is V{X) = 4.61, while the residual variance, 
not "explained" by the qualitative variables under consideration is ||X — 
Ej{X)\\'^ = 1.94. The decomposition of the variance ([3]) is: 

4.61 = (0.30 + 0.28 + 0.49 + 0.45 + 0.51 + 0.33 + 0.31) + 1.94 = 2.67 + 1.94 

Thus 2.67 represents the portion of the variance which is "explained" by the 
characters considered. We may say, therefore, that these characters explain 
62% of the variance. 

In this case the ranking obtained by our method is relatively "robust" . Indeed 
if we omit consideration of one of the characters, the relative ranking of the 
other characters remains unchanged. We do not claim of course that this 
type of "robustness" is inherent in our method. It may very well occur, with 
different data, that omitting one character would determine a change in the 
order of the remaining characters. 

We compared our results with the results obtained by using the binomial 
logistic regression. The delay in obtaining the degree becomes dicotomic 
assigning value zero to the population of graduates with a delay less than 
one year (34.1%) and value one to the others (65.9%). The results of our 
computations are shown in Table 1. 

We observe that also in this case the rank in terms of size of the variances 
coincides, except for one inversion, with the ranking obtained by SOO. It 
should be noted however that the application of binomial logistic regression 
implies an arbitrary dicotomization of the variable. Moreover, it is question- 
able in this case that the binomial logistic regression would add information 
of inferential value, because its application leads to many classes which are 
empty or with few individuals. 
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Variable 


Variance 


var, GD=100 


GD 


0.0119 


100.0 


UN 


0.0076 


63.6 


MA 


0.0097 


81.7 


HS 


0.0036 


30.2 


PE 


0.0012 


10.3 


WO 


0.0010 


8.4 


GN 


0.0001 


0.5 



Table 1: Binomial logistic regression of the seven variables. In the column 
"Variance" is computed the variance of probability variation. 
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Figure 3: Histogram of the delay. 
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4 A simulated experiment 

In order to better understand the properties of our Stepwise Optimal Order, 
we performed a simulation, repeating 20 times the following experiment. 
First we constructed 10 vectors Xi, . . . , Xiq each of 100 components and each 
component extracted from a simulated Bernoulli variable. Then we consid- 
ered the variable 

X = CiXi + C2X2 H h CioXio + e (6) 

with ci = 1, C2 = 0.9, . . . , Cio = 0.1 and e consisting of 100 independent real- 
izations of a simulated Gaussian variable with mean and standard deviation 

0. 03. 

In 18 cases out of the 20 observed experiments, SOO was exactly 1, 2, 3, ... , 10, 

1. e. for the variable x this order reflected, most of the time, the size of the 
coefficients ci, . . . , cio which enter formula (EI). In the remaining two case the 
difference between SOO and the increasing order was just one inversion. 
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